Task 1: Vehicle Segmentation¶

Introduction¶

The objective of this task is to cluster vehicles into distinct groups based on their specifications to uncover meaningful patterns. Steps like data cleaning, feature selection, preprocessing, dimensionality reduction, and clustering are performed to organize the data into interpretable segments. Visualizations and evaluations are used to assess the clusters' quality and characteristics.

Background¶

The information presented in this report is gathered from the following sources:

  • Information outlined in the project requirements document;
  • Details provided on Kaggle;
  • Documentation and code traced back through GitHub commits.

Before diving into the analysis, it is essential to understand the nature of the data. This step is critical as it guides actions such as:

  • Making reasonable assumptions about the data;
  • Handling duplicated and missing values;
  • Interpreting and understanding the results of this report.

Data¶

The dataset has the following characteristics:

  • Original source: The data comes from Otomoto.pl, a popular Polish car sales platform. It consists of self-reported information from individuals and agencies. Most fields are filled using dropdown menus, while numeric fields allow users to input their values. The platform also offers unstructured data, such as images and detailed car descriptions, though these are not included in the dataset.

  • Method of collection: The dataset was scraped from the Otomoto.pl website by a student from Warsaw's Military University of Technology as part of their coursework. It represents a snapshot of the platform’s data at a specific point in time, approximately three years ago on December 4, 2021.

  • Scope: The dataset includes 208,304 observations across 25 variables.

  • Timeline: The timeline based on the offer publication date ranges from March 26, 2021 to May 5, 2021.

1. Data Preparation¶

1.a. Duplicated and Missing Values¶

Before proceeding with any further work, it is essential to ensure that any duplicate values are removed from the dataset. In a business context, this refers to ads that contain identical information. These duplicates typically arise when the website logic fails to filter out identical listings. Below are some key statistics related to this matter:

Next, we will evaluate the completeness of the data by measuring the number of non-empty values for each variable. I have categorized the variables into three groups as follows:

  • Green: Fully usable variables.
  • Yellow: Variables with an acceptable level of completeness, where it is reasonable to remove NAs and proceed.
  • Red: Variables with an unacceptable level of completeness, requiring removal.

1.b. Pre-Processing and Feature Engineering¶

At this stage, I will categorize the variables into two groups: numeric and categorical. Based on the variable type, I will perform the following actions:

  • Numeric variables: Pre-process and proceed with all meaningful variables.
  • Categorical variables: Identify variables with a low number of levels and either apply one-hot encoding or transform them into a numeric format. As you may have noticed, I am placing significant emphasis on converting all variables into numeric format, as this is a requirement for certain dimensionality reduction and clustering methods.

For the categorical variables mentioned above, we will focus on those with a lower number of levels or classes: Drive, Condition, and Transmission. Due to their limited number of levels, these variables are easier to convert into dummy variables for interpretation:

  • Condition and Transmission: Suitable for one-hot encoding.
  • Drive: Contains too many levels. I may consider combining some levels into a new category called 4x4 (all), but first, I will examine the behavior within the existing classes.

Additionally, some categorical variables can be converted into numeric format for improved usability and comparability:

  • Features: Transform into a numeric variable, Number_of_features.
  • Offer_publication_date: Convert into Days_on_market.

For numeric values, there are some quick wins in transformation activities. I am applying the following:

  • Price: Using the Currency column, convert to Price_in_CAD for improved interpretability.
  • Production_year: Transform into Vehicle_age for easier interpretation.

Now that we have created all the features and selected the in-scope variables, let's examine two plots:

  • Correlation Matrix: This will allow us to visually assess the relationships between variables.
  • Distribution Plots: This helps identify outliers that may introduce unnecessary noise, potentially obscuring important signals.

It is important to note that the data is self-reported, which means we may question the reasonability of certain values. The variables outlined below have been capped to remove extremely high values that could introduce noise into the data. Some high values were retained as long as the distribution appears to follow a continuous scale. For capping purposes, the following limits were applied:

  • Mileage_km: Capped at 1,000,000.
  • Doors_number: Capped at 6.
  • Prince_in_CAD: Capped at 1,000,000.

2. Dimensionality Reduction¶

Principal Component Analysis (PCA) is a widely used technique for reducing the dimensionality of datasets. It enhances interpretability while minimizing information loss. The process involves two key steps:

  • Identifying the Number of Principal Components: This step determines how many components are needed to explain a significant portion of the variance in the data. The most common approach is to examine a scree plot or cumulative variance plot to identify the optimal number of components.

  • Performing PCA: After determining the appropriate number of components, PCA is applied to transform the data into the new set of components.

The scree plot above illustrates how the explained variance changes as the number of principal components increases. To determine the optimal number of components, we aim for a cumulative explained variance between 80% and 90%. This decision is somewhat subjective; in this case, 80%–90% cumulative explained variance corresponds to a range of 5 to 7 principal components. I am comfortable selecting 5 principal components.

Using PCA not only improves the efficiency of clustering techniques by reducing computation time but also helps mitigate random noise in the data and makes clusters more distinguishable.

3. Clustering¶

At this stage, I am applying a clustering techniques called K-means and DBCAN. In the preceding sections, I prepared the data by reducing its dimensionality and ensuring that only numeric values are supplied. Each method relies on specific parameters that are relatively sensitive and must be carefully tuned.

3.a. K-Means Clustering¶

K-Means clustering is a method used to group similar data points into a set number of clusters. It works by finding the center (centroid) of each group and then placing data points closest to that center. Based on the elbow method, k = 6 appears to be the optimal number of clusters.

3.b. DBSCAN Clustering¶

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points based on density, assigning points in low-density areas as noise. There are two key parameters that need to be tuned:

  • Eps: The maximum distance between two points for them to be considered neighbors.
  • Min_samples: The minimum number of points required to form a dense cluster.

Similar to the approach used for k-means clustering, we can estimate the Eps value based on the K-distance graph and the elbow method. For Min_Samples, the general recommendation is to consider the dimensionality of the data and choose a value around 2 times number of variables. Based on these steps, the recommended values are:

  • Eps: 0.5
  • Min_Samples: 30

4. Cluster Evaluation¶

Before proceeding with the evaluation, let's briefly review some of the assumptions of the methods mentioned earlier to determine if any are being violated and to select the most theoretically suitable method for our data:

K-Means:

  • Spherical clusters: Best suited for clusters that are round or ball-shaped.
  • Similar cluster sizes: Prefers clusters that are roughly equal in size.
  • No noise or outliers: Assumes all data points belong to a cluster.
  • Even density: Assumes clusters have a similar density of points.

DBSCAN:

  • Dense clusters: Groups points based on high-density regions rather than shape.
  • Noise exists: Capable of identifying outliers or points that do not belong to any cluster.
  • No fixed cluster size: Can accommodate clusters of varying sizes and shapes.
  • Varying density: Works well with clusters that have different densities.

The performance of each method will be evaluated using:

  • Silhouette Scores: A metric that assesses how well values within each cluster are grouped.
  • Practical Considerations: The feasibility of accommodating the resulting number of clusters or profiles.

The Silhouette Score measures clustering quality by assessing how well data points fit within their assigned clusters compared to others. Scores range from -1 to 1, with higher values indicating well-separated and cohesive clusters, while lower scores suggest overlapping or incorrect assignments. It balances intra-cluster cohesion with inter-cluster separation, making it valuable for comparing clustering methods and identifying the optimal number of clusters.

4.a. K-Means Clustering¶

The K-Means clustering with K = 6 yields a low Silhouette Score of 0.286, indicating suboptimal clustering performance. This score reflects poor separation between clusters and significant overlaps, resulting in ambiguity in data point assignments.

Upon observing the scatter plot:

  • Some clusters, such as red and purple, appear distinct.
  • Others, like blue, green, yellow and orange, show considerable overlap, suggesting unclear boundaries.
  • Additionally, cluster sizes are uneven, with one cluster (yellow) being disproportionately larger, which may skew the results.

4.b. DBSCAN Clustering¶

The DBSCAN clustering results yield a Silhouette Score of 0.320, which, while slightly higher than the K-Means score, still indicates moderate clustering quality.

From the scatter plot:

  • The red points labeled as cluster -1 represent noise or outliers, effectively identified by DBSCAN—a notable advantage over K-Means.
  • However, the remaining clusters are not entirely well-separated, with overlaps observed among clusters such as green, blue, and yellow.
  • The results highlight the density-based nature of DBSCAN, as smaller, denser clusters (e.g., yellow, purple) are distinguishable, while more spread-out points are marked as noise.

5. Interpretation and Insights¶

5.a. K-Means Clustering Results¶

The Silhouette Score of 0.286 indicates moderate clustering quality, suggesting poor separation and potential overlaps between clusters. While K-Means provides an interpretable solution with six distinct clusters, it assumes spherical shapes and equal variance, which may not apply to the data. Despite these limitations, K-Means is preferred for profiling due to its simpler interpretation and fewer clusters compared to DBSCAN.

Let’s use profiling to validate the characteristics of the clusters based on some of the original features:

  • Cluster 0: Composed of fully used cars with manual transmission. This is the largest cluster, primarily consisting of vehicles with lower horsepower and some of the lowest-priced vehicles.
  • Cluster 1: Contains used cars with automatic transmission, typically newer models (lower vehicle age).
  • Cluster 2: Includes all new cars, both manual and automatic, with the second-highest prices in the market.
  • Cluster 3: Comprises fully automatic older cars with the highest mileage and some of the lowest prices. These vehicles take longer to sell.
  • Cluster 4: Mostly consists of fully automatic older cars, featuring the highest horsepower and price range.
  • Cluster 5: Contains older, fully automatic cars with the lowest horsepower, often among the oldest models. These vehicles also take longer to sell.

5.b. DBSCAN Clustering Results¶

The DBSCAN clustering results show slightly better performance, with a Silhouette Score of 0.320 compared to 0.286 for K-Means. It effectively identifies noise and outliers, as seen with cluster -1 in the scatter plot. However, the larger number of clusters makes interpretation more challenging and less practical for profiling. While DBSCAN’s density-based approach offers better separation in some areas, it complicates the generation of actionable insights for business purposes.